Geo-Spatial Analysis of Overcrowding Busses in Metro Vancouver

By Francis Lee

Outcomes

Introduction

TransLink is Metro Vancouver's transportation network. As an essential public service, Translink conducts surveys every year to measure their progress towards their Customer Experience Action Plan in which they say they aim to increase accessibility, cleanliness, and comfort. With the COVID-19 pandemic, it's become abundtantly clear that an overcrowded bus can heavily affect all three of those factors. This project aims to show the areas in Metro Vancouver that are experiencing the most amount of boardings and overcrowding. As well as asking, is overcrowding just an issue of how many people are using each line? Or are busses not being equally distributed for each bus line?

This project can also be found on https://github.com/farnyp/econ323proj

Data Sourcing and Methodology

All data sources collected for this project are provided publically by Translink. I will be using 3 datasets, 2 of which are provided by the 2019 TransLink Service Performance Review: Bus: Annual Indicators by Route and Year, Bus: Key Characteristics by Route, TransLink Transit GIS Data, 09 August 2019.

This project will disregard all other major transportation routes such as SkyTrain, West Coast Express, and SeaBus and focus only on bus routes around the Metro Vancouver area. I will be performing two regression analyses, using a rudimentary linear regression model and a neural network. Afterwards, we will focus on visualizing the geospatial data of bus routes.

The dataset contains a large number of variables, most of which I will not be using for the purpose of this project. Therefore, I will only be keeping a few quantitative variables such as the annual passenger boardings, average weekday boardings, and percentage of trips that were overcrowded.

To properly use this dataset, I will need to fill all Null values with 0 as having no value in these variables means that there were no boardings.

My next step is to load the second variable which includes the names and subregion for each line. Then, when I merge the two datasets together, I will have a very tidy and descriptive dataset that is very pleasing and easy to read.

This dataset brings a tear to my eye

Regression Analysis

For these next steps, I will be performing a linear regression and a neural network regression. I will be focusing my regression analysis on average weekday boardings vs percentage of trips with overcrowding.

A humourous takeaway that can be seen from this result is that with average weekday boardings being 0, you will still experience a percentage of overcrowded trips of 1.8%. In seriousness, we can infer that these two variables may not be heavily correlated with each other. We can visualize this with a scatterplot below.

As we can see from this scatterplot, these two variables are not heavily correlated. Next, we will move on to regression using neural networks.

Using a scaled neural network model, we can see that the MSE is high, meaning the model is not a strong predictor. We can visualize the predicted values below.

Geospatial Analysis: Heatmapping the average weekday boardings and percentage of trips with overcrowding

This next section will focus on using geospatial data to visualize the difference in the two variables. Through this analysis, we can see which regions have the highest passengers and overcrowding, and if those areas are relatively the same.

We will start by grouping the dataframe by year. In the code, I am grouping by line_no as I have removed the Year variable, but each row containing the same line_no contains observations for each year from 2015 to 2019. Thus, grouping by line_no will have the same effect.

As a bonus, and to help show how the comparative mean differences in each variable as well as the number of lines in each sub-region, we will group the data by Sub-Region of Primary Service.

What we can see from this is that while Vancouver/UBC has by far the highest number of average weekday boardings, it barely beats the Southeast region in Overcrowding with a 0.48% difference. Meanwhile, the Southeast region has nearly 9000 less passengers on average, as well as 14 more bus routes.

We begin our mapping by loading our third and final dataset.

Next we will create the data series required for the two heatmaps

Finally, we create the basemap as well as the first map where we show all the transit routes in Metro Vancouver

Next, we will see visualize how the regions differ in term of average weekday boardings.

Finally, we will visualize the intensity of trips with overcrowding with another heat map.

Through the visualizations above, we can see that overcrowding is much more varied, with the North Shore and Southeast regions having much more hotspots than in the map of average weekday boardings.

Conclusion

Based on the regression analysis and geospatial analysis, we can see that TransLink does not use Average Weekday Passenger Boardings as a strong indicator for dealing with overcrowding in busses. Interestingly, Vancouver sees less overcrowding compared to average boardings, while the regions surrounding Vancouver see quite the opposite. Despite the limitations, this project helps me recommend that TransLink use more infrastructure and busses in the regions surrounding Vancouver.

Bonus: Save the maps in .html

You can run the code below to save all the maps in .html format.